Highlighting uncertainty in clinical risk prediction using a model of emergency laparotomy mortality risk

Clinical prediction models typically make point estimates of risk. However, values of key variables are often missing during model development or at prediction time, meaning that the point estimates mask significant uncertainty and can lead to over-confident decision making. We present a model of mortality risk in emergency laparotomy which instead presents a distribution of predicted risks, highlighting the uncertainty over the risk of death with an intuitive visualisation. We developed and validated our model using data from 127134 emergency laparotomies from patients in England and Wales during 2013–2019. We captured the uncertainty arising from missing data using multiple imputation, allowing prospective, patient-specific imputation for variables that were frequently missing. Prospective imputation allows early prognostication in patients where these variables are not yet measured, accounting for the additional uncertainty this induces. Our model showed good discrimination and calibration (95% confidence intervals: Brier score 0.071–0.078, C statistic 0.859–0.873, calibration error 0.031–0.059) on unseen data from 37 hospitals, consistently improving upon the current gold-standard model. The dispersion of the predicted risks varied significantly between patients and increased where prospective imputation occurred. We present a case study that illustrates the potential impact of uncertainty quantification on clinical decision making. Our model improves mortality risk prediction in emergency laparotomy and has the potential to inform decision-makers and assist discussions with patients and their families. Our analysis code was robustly developed and is publicly available for easy replication of our study and adaptation to predicting other outcomes.


Supplementary Note A: Data preparation
Implausible values for continuous preoperative variables were redacted as shown in Supplementary Table 1. Bivariate scatterplots of preoperative blood urea nitrogen (BUN) and creatinine values suggested that, in a small minority of cases, creatinine values had erroneously been recorded as BUN values and vice versa. We addressed this by redacting both BUN and creatinine values for 222 cases where these values were the same, and for 199 cases where BUN was greater than 84 mg dL -1 and also greater than 2 × creatinine. Supplementary Figure 1 shows the effect of this policy.
The NELA data collection proforma allows users to select one or more of 27 surgical indications, some of which are rarely chosen. Whilst ≤2 indications are selected for most cases, a minority have a large number of chosen indications. Including all indications in RUNE, in any combination, would be overwhelming for users and create many rarely-used categories. We therefore opted to consolidate the indications to only those that were chosen most commonly, and to limit each case to one indication.
Common single indications were defined as those chosen solely in ≥1200 of cases. The most commonly-chosen pairs and trios of indications were manually consolidated to the most similar common single indication. Remaining cases were defined as 'Other indication' or as missing. Missing indications were multiply imputed as described in the main manuscript. Full details of the indication consolidation process are shown in Supplementary Table 2.

Supplementary Note B: Selection from RUNE candidate covariates
The 26 candidate covariates for RUNE were: age, sex, American Society of Anesthesiologists (ASA) physical status, cardiovascular status, respiratory status, Glasgow coma score, whether CT scan had been performed, predicted degree of peritoneal soiling, presence of malignancy, surgical indication, surgical urgency, number of surgeries within 30 days, severity of the intended surgery, predicted blood loss, heart rate, presence of arrhythmia, systolic arterial pressure, white cell count, haemoglobin, C-reactive protein, sodium, potassium, creatinine, blood urea nitrogen (BUN), lactate and albumin.
Missing values were imputed as described in the main manuscript and Supplementary Note C, except that lactate and albumin were imputed in the same way as the other continuous variables (linear regression followed by predictive mean matching) rather than using customised GAMs. To limit the computational intensity of the analysis, which was increased by the cross validation described below, we performed ten rounds of multiple imputation rather than the larger number used in the main analysis.
Several variables (heart rate, presence of arrhythmia, BUN, creatinine, cardiovascular status, respiratory status, whether CT scan had been performed, predicted degree of peritoneal soiling, presence of malignancy, surgical indication) were used in prespecified tensor-product interaction terms. These terms were considered a priori important and their constituent variables were therefore not considered for backward elimination.
Seven of the remaining 16 candidate covariates were removed via backward elimination 1 . This process used only the development data from the first development-validation split. During this process, the mutual information of the covariates may result in weak association with mortality risk for some covariates despite a strong association with mortality risk in univariable modelling.
A version of RUNE containing all the candidate covariates (referred to as full RUNE) was specified, and its regularisation terms were tuned, as described in the main manuscript. In each of the seven rounds of backward elimination, each of the remaining candidate covariates was considered for elimination as follows: 1. Remove the candidate covariate under consideration (plus any previously-eliminated covariates) from full RUNE, yielding candidate RUNE. 2. Split each of the ten datasets from multiple imputation into train and test subsets, in a 2:1 ratio. 3. Using the ten train subsets, fit ten versions of candidate RUNE then combine them using Rubin's rules as described in Supplementary Note C, yielding combined candidate RUNE. 4. Calculate the log loss of combined candidate RUNE on each of the ten test subsets. 5. Repeat steps 2-4 twice more, i.e. perform three-fold cross validation overall. 6. Calculate the mean of the 30 log losses obtained overall.
The candidate covariate which yields the lowest mean log loss (i.e. which diminishes predictive performance the least) in that round is then eliminated.

Supplementary Note C: Missing data imputation
The different imputation models used in our analysis necessitated that two rounds of multiple imputation were conducted in series whilst fitting RUNE: In the first round, variables apart from lactate and albumin were imputed, whilst lactate and albumin alone were imputed in the second. The minimum number of imputations required for the jth round was determined as mj = ceiling(fj ⨉ 100), where fj is the fraction of cases that are incomplete for any variable being imputed in round j ∈ {1, 2}. In general, fj is at least as large as the fraction of missing information 2 . m1 imputations were performed in round one. The number of imputations round two was the smallest integer greater than or equal to m2, that was also a multiple of m1.
MICE for binary and continuous variables (apart from lactate and albumin) used logistic and linear regression respectively, without transformation of the imputation model covariates, followed by predictive mean matching. In predictive mean matching, the output of the regression model is compared to the observed values for the variable being imputed, and its 20 nearest observed neighbours (by Euclidean distance) are found. The imputed value is sampled from a discrete uniform distribution over these nearest neighbours.
Different imputed values were obtained at each MICE iteration by setting the imputation models' coefficients to samples from their approximate posterior distributions. For each categorical variable, a categorical distribution was obtained using multinomial regression on the MICE output (again, without transformation of the imputation model covariates), then the variable was multiple imputed by repeated sampling from this distribution.
To validate our treatment of lactate and albumin as missing at random, we derived lactate and albumin missingness-indicator variables for use as RUNE covariates. We hypothesised that, if lactate or albumin were more likely to be missing in patients who were less unwell, and the other RUNE covariates failed to fully quantify this wellness, then the missingness-indicator variables would be significantly associated with mortality risk. After fitting RUNE as described in the main manuscript, we calculated p-values for the missingness indicator terms. The null hypothesis for each term was that its coefficient was zero, after compensating for bias in the other model terms 3 . P-values (1 degree of freedom) for both missingness indicators were 1.0 in every cross-validation fold, validating our treatment of lactate and albumin as missing at random.
GAMs (for lactate and albumin imputation, and RUNE itself) fit on the complete datasets output from multiple imputation were combined using Rubin's rules 2 . Specifically, given m complete datasets from multiple imputation, we fit m GAMs with approximate posterior distributions {N(μi, Σi) : 1 ≤ i ≤ m} over their coefficients. The combined GAM has coefficients with mean * = 1 ∑ =1 and covariance Σ* = (1 .

Supplementary Note D: Evaluating model calibration
We obtained smooth calibration curves for the re-fitted NELA Calculator and RUNE as follows: Given the validation-set mortality labels y and corresponding predicted mortality risks ŷ, we use a GAM to model yi ~ Bernoulli(p=σ(f(ŷi) + intercept)), where f is a spline transformation using 5 second-degree polynomial splines with a smoothing penalty on their second derivative, and σ is the logistic function. Given a predicted mortality risk, this GAM thereby estimates the corresponding true mortality risk. We fit a separate calibration GAM on each of the 120 validation sets, and obtain a smooth calibration curve for each by evaluating the fitted GAM over the interval [0, 1].
For the re-fitted NELA Calculator, we choose the smoothing penalty which produced the best fit amongst 50 candidates logarithmically spaced between 10 -3 and 10 3 . The calibration GAM fitting for RUNE was too computationally intensive (as the fitting process uses 420 predicted risks for each case in the validation set ≈ 10,000,000 risks) to repeat for each of 50 candidate smoothing penalties, so a two-stage smoothing penalty selection process was adopted: we derived point predictions of mortality risk from RUNE by taking the median of each predicted risk distribution, and fit calibration GAMs on these using the 50 candidate smoothing penalties described above. We found the three smoothing penalties most commonly selected in this process, and used them as candidates in the final calibration GAM fitting for RUNE.

Supplementary Note E: Performance of the albumin and lactate imputation sub-models
The imputation sub-models explained only a limited proportion of the observed variance in albumin and lactate.
In the main-analysis imputation models which excluded mortality as a covariate, median (95% CI) R 2 was 0.22 (0.19 -0.25) for the albumin imputation model and 0.22 (0.20 -0.25) for the lactate imputation model. In the sensitivity-analysis imputation models which included mortality as a covariate, median (95% CI) R 2 was 0.23 (0.20 -0.25) for the albumin imputation model and 0.24 (0.22 -0.27) for the lactate imputation model. Partial dependence plots for each input variable in the main-analysis imputation models are shown in Supplementary  Figures 2 and 3.

Supplementary Note F: Sensitivity to mortality covariate in lactate and albumin imputation models
As shown in Supplementary Figure 4, excluding mortality as a covariate in the lactate and albumin imputation models caused RUNE to underestimate lactate and albumin's association with mortality risk. This effect was more marked for albumin, which may result from its higher proportion of missing values.

Supplementary Note G: Sensitivity to excluding albumin as a RUNE covariate
Supplementary Table 3 compares the performance of RUNE with the sensitivity analysis version of RUNE that excludes albumin as a covariate ('No-albumin RUNE'). Excluding albumin produced a small decrease in model discrimination with no discernible change in calibration. This reduction in predictive performance, in comparison to the main analysis, supports our including albumin as a RUNE covariate despite not being able to model its changing missingness over time.
Supplementary Figure 5 shows, for no-albumin RUNE, the contribution of each variable to predicted mortality risk if the other variables are held fixed. The relationships of the remaining covariates with mortality risk are unchanged compared with those observed in Figure 2 of the main manuscript. This supports our not including interaction terms between albumin and the other covariates in RUNE. Supplementary Figure 4. Spline terms for lactate and albumin in the novel mortality risk model, both for the original (main analysis) version of RUNE which excludes mortality as a covariate in lactate and albumin imputation sub-models, and the re-fitted (sensitivity analysis) version of RUNE which includes mortality as a covariate in lactate and albumin imputation sub-models. Blue and orange bands show changes in predicted mortality risk. The band's width for each variable is proportional to the uncertainty of its association with mortality risk. From lightest to darkest, the 4 overlapping bands represent 95%, 70%, 45% and 20% confidence.